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Abstract 

The Mixture Transition Distribution (MTD) model was introduced 
by Raftery to face the need for parsimony in the modeling of high-order 
Markov chains in discrete time. The particularity of this model comes 
from the fact that the effect of each lag upon the present is considered 
separately and additively, so that the number of parameters required is 
drastically reduced. However, the efficiency for the MTD parameter es- 
timations proposed up to date still remains problematic on account of 
the large number of constraints on the parameters. In this paper, an it- 
erative procedure, commonly known as Expectation-Maximization (EM) 
algorithm, is developed cooperating with the principle of Maximum Like- 
lihood Estimation (MLE) to estimate the MTD parameters. Some ap- 
plications of modeling MTD show the proposed EM algorithm is easier 
to be used than the algorithm developed by Berchtold. Moreover, the 
EM Estimations of parameters for high-order MTD models led on DNA 
sequences outperform the corresponding fully parametrized Markov chain 
in terms of Bayesian Information Criterion. 

A software implementation of our algorithm is available in the library 
seqH — h at http : / / stat . genopole . cnrs . f r/ seqpp. 

keywords: Markov chain; mixture transition distribution (MTD); Par- 
simony; Maximum likelihood; EM algorithm; 



1 Introduction 

While providing a useful framework for discrete-time sequence mod- 
eling, higher-order Markov chains suffer from the exponential growth 
of the parameter space dimension with respect to the order of the 
model, which results in the inefficiency of the parameters'estimations 
when a limited amount of data is available. This fact motivates 
the developments of approximate versions of higher-order Markov 
chains, such as the Mixture Transition Distribution (MTD) model 
[ill . 0] and variable length Markov chains Thanks to a simple 
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structure, where each lag contributes to the prediction of the cur- 
rent letter in a separate and additive way, the dimension of model 
parameter space grows only linearly with respect to the order of the 
MTD model. 

Nevertheless, Maximum Likelihood Estimation (MLE) in the MTD 
model is subject to such constraints that analytical solutions are be- 
yond the reach of present methods. One has thus to retort to numer- 
ical optimization procedures. The most powerful method proposed 
to this day is due to Berchtold [2j, and relies on an ad-hoc opti- 
mization method. In this paper, we propose to fit the MTD model 
into the general framework of hidden variable models, and derive 
a version of the classical EM algorithm for the estimations of its 
parameters. 

In this first section, we define the MTD model and recall its main 
features and some of its variants. Parametrization of the model 
is discussed in section 2, where we establish that under the most 
general definition, it is not identifiable. Then we shed light on an 
identifiable set of parameters. Derivations of the update formulas 
involved by the EM algorithm are detailed in section 3. We finally 
illustrate our method by some applications to biological sequence 
modeling. 

Need for parsimony Markov models are pertinent to analyze m- 
letter words' composition of a sequence of random variables 0, 0]. 
Nevertheless, the length m of the words the model accounts for has 
to be chosen by the statistician. On the one hand, a high order 
is always preferred since it can capture strictly more information. 
On the other hand, since the parameter's dimension increases expo- 
nentially fast with respect to the order of the model, higher order 
models cannot be accurately estimated. Thus, a trade-off has to be 
drawn to optimize the amount of information extracted from the 
data. 

We illustrate this phenomenon by running a simple experiment: 
by using a randomly chosen Markov chain transition matrix of order 
5, we sample 1000 sequences of length 5000. Each of them is then 
used to estimate a Markov model transition matrix of order varying 
from 2 to 6. For each of these estimates, we have plotted the total 
variation distance with respect to the generating model (see Figure 
[T]), computed as the quantity D VT (P,Q) = J2 x ey n \P( X ) ~ Qi x )\ 
for distributions P and Q. It turns out that the optimal estimation 
in terms of total variation distance between genuine and estimated 
distributions is obtained with a model of order 2 whereas the gen- 
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Figure 1: Total variation distance between distributions estimated from ran- 
domly generated sequences and the generating distribution. The generating 
model is of order 5, and the random sequences are 5000 letters long. 

erating model is of order 5. 

Mixture Transition Distributions aim at providing a model ac- 
counting for the number of occurrences of ra-letter words, while 
avoiding the exponential increase with respect to m of the full Markov 
model parameter's dimension (See Table Q] for a comparison of the 
models' dimensions). 

MTD modeling Let Y = (Y\, . . . ,Y n ) be a sequence of random 
variables taking values in the finite set y — {1, . . . , g}. We use the 
notation, 

Y t l 2 = (Y tl ,Y tl+ i, . . . , Y t2 ) 

to refer to the subsequence of the t2—t\ + l successive variables. In 
the whole paper, vectors and matrices are denoted by bold letters. 

Definition 1 The random sequence Y is said to be an m th order 
MTD sequence if 

m 

Vt>my yi ,...,y t ey, ¥{Y t = y t \Yl- 1 = y\- 1 ) = J> 9 ¥(Y t = y t \Y t . g = y t - 9 ) 

9=1 
m 

= ^2<Pgng{yt-g,yt)- (1) 
9=1 
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where the vector cp = (ipi, . . . , tp m ) is subject to the constraints: 

Vge{l,...,m}> <p g >0, (2) 



£ 

9=1 



1. (3) 



and the matrices {Tz g = \P(Y t — j\Yt- g = ^)]ij E y ^ < 9 < m } are 
q x q stochastic matrices. 

A mth-order MTD model is thus denned by a vector parameter, 



L<g<m 



which belongs to the space 




Vz,j G y, < 7r g (i,j) < land ^2^ g (i,j) = 

jay 

It is obvious from the first equality in equation ([T]) that the MTD 
model fulfills the Markov property Thus, MTD models are Markov 
models with the particularity that each lag Yt-i, Yt-2, ■ ■ ■ contributes 
additively to the distribution of the random variable Y t . Berchtold 
and Raftery |3j published a complete review of the MTD model. 
They recall theoretical results on the limiting behavior of the model 
and on its auto-correlation structure. Details are given about sev- 
eral extensions of this model, such as infinite-lag models, or infinite 
countable and continuous state space. 

We have to point out that Raftery [ll| defined the original model 
with the same transition matrix 7r for each lag {^t- s } s =i,..,, m - 111 
the sequel, we refer to this model as the single matrix MTD model. 
Later, Berchtold [l[ introduced a more general definition of the MTD 
models as a mixture of transitions from different subsets of lagged 
variables {Y t - m , . . . , Y t -i\ to the present one Y t , eventually discard- 
ing some of the dependencies. In this paper, we focus on a slightly 
more restricted model having a specific but same order transition 
matrix 7r p for each lag Y t _ g . We denote by MTD; the MTD model 
which has a Z-order transition matrix for each lag (Definition [2]). 
From now on, the MTD model defined by ([T]) is denoted accord- 
ingly by MTDi. 
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Definition 2 The random sequence Y is a m-order MTD\ sequence 
if, for all l,m eN such that I < m, and all y\ G y l : 

F(Y t = y t \Yt x = yt- 1 ) = F(Y t = itlY^ = ytz'J 

m—l+l 

9=1 
m-l+1 

= to Kgiytg-i+vVt)- 

9=1 

holds, where ir g is a q l x q transition matrix. 

Trade-off between dimension and maximal likelihood Even though 
MTD models involve a restricted amount of parameters compared 
to Markov chains, increasing the order I of the model may result 
in efficiency of the MLE decreased. The quality of the trade-off 
between goodness-of-fit and generalization error a model achieves 
can be assessed against classical model selection criteria, such as 
the Bayesian Information Criterion (see illustrations in section fl~2l) . 

However, computing the BIC requires the knowledge of the di- 
mension of the model. This dimension is usually computed as 
the dimension of the parameter space for a bijective parametriza- 
tion. In the specific case of the MTD models, the original single- 
matrix model is parametrized in a bijective way, whereas its gener- 
alized version with specific transition matrices for each lag is over- 
parametrized: in appendix |X] is given an example of two distinct 
values of the parameters (ip, 7r), which both define the same MTD! 
distribution. The dimension of the model is thus lower than the 
dimension of the parameter space, and computing the BIC using 
the parameter space dimension would over-penalize the models. A 
tighter upper bound of the dimension of the MTD; model is derived 
in section [2l a bound which is used later to compute the BIC. 

The question of estimation As a counterpart for their parsimony, 
MTD parameters are difficult to be estimated due to the constraints 
that the transition probabilities {f(i m . . . i x \ io); i m , . . . , io £ y} have 
to comply to. There is indeed no analytical solution to the maxi- 
mization of the log-likelihood L y (0) = F (Y = y) of the MTD 
models under the constraints the vector ip and the stochastic matri- 
ces Tz g have to fulfill. For a given sequence y — yx, . . . ,y n of length 
n, we recall that the loglikelihood of the sequence y under the MTDi 



■5 



model writes 



L y (8) = logP e (17 = tf) 

{n / m 

F(Y™ = y?) JJ ^tt^,^) 
t=m+l \9=1 

The estimation of the original single matrix MTD model already 
aroused a lot of interest. Although any distribution from this model 
is defined by a unique parameter d, the maximum likelihood can 



not be analytically determined. Li and Kwok [10] propose an inter- 
esting alternative to the maximum likelihood with a minimum chi- 
square method. Nevertheless, they carry out estimations by using 
a non-linear optimization algorithm that is not explicitly described. 
Raftery and Tavare [12|] obtain approximations of both maximum 
likelihood and minimum chi-square estimates with numerical proce- 
dures from the NAG library which is not freely available. They also 
show that the MTD model can be estimated using GLIM (Gener- 
alized Linear Interactive Modeling) in the specific case where the 
state space's size q equals 2. Finally, Berchtold [3] developed an ad 
hoc iterative method implementing a constrained gradient descent 
optimization. This algorithm is based on the assumption that the 
vector cp and each row of the matrix 7r are independent. It consists 
in successively updating each of these vectors constrained to have a 
sum of components equal to 1 as follows. 



Berchtold's Algorithm 

• Compute partial derivatives of the log likelihood according to 
each element of the vector, 

• choose a value 5 in [0, 1], 

• add 5 to the the component with the largest derivative, and 
subtract 5 from the one with the smallest derivative. 

This algorithm has been shown to perform at least better than the 
previous methods, and it can be extended to the case of the MTD/ 
models. In this latter case, it estimates one of the parameter vec- 
tors {(tp g , 7r 5 ); 1 < g < m} describing the maximum-likelihood MTD 
distribution. Nevertheless, the choice of the alteration parameter S 
remains an issue of the method. An in-depth discussion of the strat- 
egy used to update the alteration parameter 5 can be found in |2j. 

We propose to approximate the maximum likelihood estimate of 

the MTD model < Pml(« w . . . i\\ zq); i m , ■ ■ ■ , io G y \ by coming down 
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to a better known problem: estimation of incomplete data with an 
Expectation-Maximization (EM) algorithm [5[ . We introduce a sim- 
ple estimation method which allows to approximate one parameter 
vector 9 = {(<p g , 7r 9 ); 1 < g < m} maximizing the log-likelihood. 

2 Upper bound of the MTD model dimension 

The MTDi model is over-parametrized. We provide an example 
of two distinct parameter values (ip, 7r) defining the same 2 nd -order 
MTD X model in appendix^] Moreover, we propose a new parameter 
set whose dimension is lower. It stems from the straightforward re- 
mark that the mth-order MTDi model satisfies the following propo- 
sition: 

Proposition 1 Transition probabilities of a mth-order MTD\ model satisfy: 
Vz m , i g , io, i' g E y, 

F(i m ...i g ...i 1 ; i ) - F(i m ...i' g ...ii; i ) = ip g [ir g (i g , io) - n g (i g , io)] ■ 

(4) 

This simply means that the left-hand side of equation (j4j) only de- 
pends on the parameter components associated to lag g. 

Consider a given distribution from MTDx with parameter (tp g , ^ g )i< g < m , 
and let u be an arbitrary element of y. Each transition probability 
F(i m ...ix] io) may be written : 

m m 

P(z m ..ii;z'o) = ^2p g [^ g (i g ,io) ~ ^g(u,i )] + ^2(p g n g (u,io). (5) 
s=i 9=1 

From Propositiond], it follows that each term of the first sum ip g [n g {i g , io) — ^g(u, io)} 
equals the difference of probabilities F(u...ui g u...u; i ) — F(u...u; i ). 
The second sum is trivially the transition probability from the m- 
letter word u . . . u to io- 

Let us denote the transition probabilities from m-letter words to 
the letter j, restricting to words differing from u . . .u by at most 
one letter : 

Pu(g ;i,j) :=F(u...uiu...u;j), (6) 

where u...uiu...u is the m-letter word whose letter in position g (from 
right to left) is i. The quantities in ([6]) are sufficient to describe the 
model, as stated in the following proposition. 
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Proposition 2 The transition probabilities of a mth-order MTD\ model satisfy: 



Vu g y, Vi m , ...,i g , ...,« e 

P(i m , ii; i ) = ]T™ =1 [Pu(g ; z fl , io) - ^ Pu(«o)] • 

where p u (j) denotes the value of p u (g;u, j) , whatever the value of g. 

For any arbitrary u element of y, a MTDx distribution can be 
parametrized by a vector 6 U from the (q— l)[l+m(g— l)]-dimensional 

set e„, 



©u = S ((Pu(g;i,j))i< g <m,i,jey such that V# G {1, . . . ,m},Vz G ^, 



2_^Pu(9',i,j) = 1 and Vc/,^' G {1, . . . ,m},p u (g;u,j) = p u {g';u,j) 



(7) 



Note that not all U in Q u define a MTDi distribution: the sum 
Y^jLi Pu(g\ ig, io) ~ ^m^-Puiio) ma y indeed fall outside the interval 
[0, 1]. For this reason, we can only claim that some subset M of 6„ 
is a parameter space for the MTDx model. However, as the com- 
ponents of a parameter 8 U G U are transition probabilities, two 
different parameter values can not define the same MTD distribu- 
tion. The mapping of Q u on the MTDi model is thus bijective, 
which results in the dimension of G u being an upper bound of the 
dimension of the MTD model. 

Whereas the original definition of the MTDi model (0Q) involves 
an m— l+mq(q— l)-dimensional parameter set, this new parametriza- 
tion lies in a smaller dimensional space, dropping q(m — 1) param- 
eters. 

Equivalent parametrization can be set for MTD models having 
higher order transition matrix for each lag. For any I > 1, a MTD; 
model can be described by a vector composed of the transition prob- 
abilities p l u (g; j) = ¥(u...uii...iiu,..u; j) for all /-letter words 
%i...i\. Denoting by 0^ the corresponding parameter space, its di- 
mension |6(J = J2k=2l ( l k ~ 2 (. < l~ l) 3 (m — k + l)] + (l + m(q— l))(q — 1) 
is again much smaller than the number of parameters originally re- 
quired to describe the MTD/ model (see jsj, section 2.2, for the 
counting details). A comparison of the dimensions according to 
both parametrizations appears in Table [U We will now make use of 
the upper bound \0 l u \ of the model's dimension to penalize the like- 
lihood in the assessment of MTD models goodness-of-fit (see section 
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Table 1: Number of independent parameters required to describe full 
Markov and MTDj models (state space size: q = 4). Except for the single 
matrix MTD model, MTD models originally denned with parameters ((p., it) are 
over parametrized: the parameter l u , introduced in section [21 requires far less 
independent parameters. Note that the 1 st order MTDi model (resp. 2 nd order 
MTD2 model) is equivalent to the I s * order (resp. 2 nd order) full Markov model. 





Full 


MTDi 


MTD 2 


Order m 


Markov 


l(wr)| 








1 


12 


12 


12 






2 


48 


25 


21 


48 


48 


3 


192 


38 


30 


97 


84 


4 


768 


51 


39 


146 


120 


5 


3 072 


64 


48 


195 


156 



3 Estimation 

In this section, we expose an EM algorithm for the estimation of 
MTD models. Firstly, this procedure allows to maximize the like- 
lihood without assuming the independence of parameters ip and n 
and offers the convergence properties of an EM algorithm. Secondly, 
from a technical point of view, the EM algorithm does not require 
any trick to fulfill the constraints holding on the (y?,7r) parameters as 
Berchtold's algorithm does. We expose here our estimation method 
of the MTDx model ([T]) having a specific I s * order transition matrix 
for each lag. The method can easily be adapted for single matrix 
MTD models as well as for MTD models having different types of 
transition matrix for each lag. Detailed derivations of the formu- 
las for identical matrix MTD and MTD/ models are presented in 
appendix [Bl 

To estimate the transition probabilities {¥(i m ....ii] io)', i m , io £ y} 
of a mth-order MTDi model, we propose to compute an approxima- 
tion of one set of parameters = {fg,i^g)i<g< m which maximizes 
the likelihood. 

3.1 Introduction of a hidden process 

Our approach lies on a particular interpretation of the model. The 
definition of the MTDx model ([T]) is equivalent to a mixture of m 
hidden models where the random variable Y t is predicted by one 
of the m Markov chains 7r g with the corresponding probability f g . 
Indeed, the coefficients (v? 9 ) 9 =i,.., m define a probability measure on 
the finite set {1, m} since they satisfy the constraints (T5]) and ([3]). 
From now on, we consider a hidden state process 5*1, S n thcit 
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Figure 2: DAG dependency structure of a 2 nd order MTDi model. 



determines the way according to which the prediction is carried 
out. The hidden state variables {St}, taking values in the finite set 
S = {l,...,m}, are independent and identically distributed, with 
distribution 

Vt<n,VgeS, F{S t = g) = <p g . 

The MTDi model is then defined as a hidden variable model. The 
observed variable Y t depends on the current hidden state St and on 
the m previous variables Yt_i, ...,Y t - m - This dependency structure 
of the model is represented as a Directed Acyclic Graph (DAG) in 
Figure El The hidden value at one position indicates which of those 
previous variables of transition matrices are to be used to draw the 
current letter: conditional on the state St, the random variable Y t 
only depends on the variable Y t _ St : 

Vt > m,Wg e S, F(Y t = y t \Y*^ = y^l, S t = g) = n g {y t - g ,yt). 

So we carry out estimation in the MTD X models as estimation 
in a mixture model where the components of the mixture are m 
Markov chains, each one predicting the variable Y t from one of the 
m previous variables. 

3.2 EM algorithm 

By considering a hidden variables model, we want to compute the 
maximum likelihood estimate from incomplete data. The EM algo- 
rithm introduced by Dempster et al. [HJ is a very classical framework 
for achieving such a task. It has proved to be particularly efficient 
at estimating various classes of hidden variable models. We make it 
entirely explicit in the case of the MTD models. 

The purpose of the EM algorithm is to approximate the maxi- 
mum of the log-likelihood of the incomplete data L y (0) = log ¥e(Y = 
y) over 6 e using the relationship 

V0,0' G Q,L y {0) = Q{0\0') - H{0\0') 
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where the quantities Q and H are defined as follows 



Q{0\9') = E[\ogW (Y,S)\Y = y,O'} 
H(0\0') = E[\og¥ e (Y,S\Y = y)\y,6'] 

The EM algorithm is divided in two steps: E-step (Expectation) 
and M-step (Maximization). Both steps consist of, respectively, 
computing and maximizing the function Q(0\0^), that is the log- 
likelihood of the complete model conditional on the observed se- 
quence y and on the current parameter 0W. Using the fact that the 
function — ► H(0\0( k >) is maximal in 0( k \ Dempster et al. proved 
that this procedure necessarily increases the log-likelihood L y {0). 
See EJ for a detailed study of the convergence properties of the 



EM algorithm. 

We now derive analytical expressions for both E-step and M- 
step. In this particular case, the log-likelihood of the complete data 
{Y™ +1 , <S^+i) conditional on the first m observations Y^ m writes: 



n m 



iogp K +1 ,s^ +1 |ir)= E EEEw^^^-i) 

t=m+l g=l iey jey 



n m 



+ E E 1 ^} 1 ^^- ( 8 ) 

t=m+l g=l 

E-step The Estimation step is computing the expectation of this 
function (JS]) conditional on the observed data y and the current 
parameter 0^ k \ that is calculating, for alH > m and for all element 
g in {1, m}, the following quantity, 

E(l {Si=g} \y,0^)=F(S t = g\y,0^). (9) 

Then, function Q writes: 



n m 



t=m+l g=l i£_y jey 



n m 



+ E E p (^=^' 0(fe) ) lo s*v (io) 

t=m+l g=l 

So E-step reduces to computing the probabilities (jHJ), for which 
we derive an explicit expression by using the theory of graphical 
models in the particular case of DAG structured dependencies [§]. 
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Figure 3: Moral graph of a 2 nd order MTDi model. 



First, remark that the state variable St depends on the sequence Y 
only through the m + 1 variables {Y t - m , Yi_i, Y t }: 

Vt <n,\/ge {1, m}, P(5i = 0) = P(fii = g\Y t i m = y\_ w 6). 

(11) 

Indeed, independence properties can be derived from the moral 
graph (Fig. [3]) which is obtained from the DAG structure of the 
dependencies (Fig. by "marrying" the parents, that is adding an 
edge between the common parents of each variable, and then delet- 
ing directions. In this moral graph, the set {Yt_ m , ...,Y t } separates 
the variable S t from the rest of the sequence {Yi, Y t _ m _i} so that 
applying corollary 3.23 from [9i] yields: 

S t ±(Y 1 t - m -\Y t l 1 )\Yt m 

From now on, we denote = i m i m -\...i\%Q any (m + l)-letter 
word composed of elements of y. For all g in {1, ...,m}, for all 
elements of y, Bayes' Theorem gives: 

P( St =g\YU = %»0) 

F(S t = g,Y t = i0 \Y t t - 1 n = i 1 m ,6) 
¥{Y t = i Q \Y t t ^ = i] rO 0) 
F(Y t = io\S t = g, Y^ = C 6)F(S t = g\Y^ = C 8\ 

EE Wt = io\s t = I = <r e)nSt = v?ti = 

We show below that the probabilities P(Yj = io\S t = g,YtI m = 
iln,0) and f(S t = glY/z^ = in expression (TT2|) are entirely 

explicit. First, conditional on 6, the state S t and the variables Y*~^, 
the distribution of Y t writes: 

F(Y t =i \S t =g, Y££ = i) w 6) = n g (i g , i ). 

Second, although the state St depends on the (m + l)-letter word 
Y t t _ m , which brings information about the probability of transition 
from Y^m to Y t , it does not depend on the m- letter word formed by 
the only variables Yfj^. This again follows from the same corollary 
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Figure 4: In black: graph of the smallest ancestral set containing St and the 
two variables (Yj_2, Yt-l) in the particular case of a 2 nd order MTDi model. 
(The part of the structure dependency DAG that is excluded from the smallest 
ancestral set appears here in light blue.) 




Figure 5: Moral graph of the smallest ancestral set in Figure 2J There is no 
path between St and the subset of 2 variables {lt_2, Yt-i}- 

in [oj . The independence of the variables St and Y^Z m is derived from 
the graph of the smallest ancestral set containing these variables, 
that is the subgraph containing St, Y t t ~{ n and the whole line of their 
ancestors (See Figure H] for an illustration when n — 2). It turns out 
that, when considering the moralization of this subgraph (Figure 
|5J), there is no path between S t and the set Y-t-m- This establishes 
S t J- Y/Zm and we have 

¥(S t = g\Y£* = C 9) = ¥(S t = g\0) = tp g . 

Finally, the probability ( [121) . is entirely determined by the current 
parameter and does not depend on the time t. 

As a result, the k th iteration of Estimation-step consists in cal- 
culating, for all g in {1, m} and for all m + 1-letters word i° m of 
elements of y, 

Wg e {1, ...,m}, V i m , ...,ii,i e y, 

vf(g\ii)=ns t = 9\Yl m = iLoW) 

M-Step Maximization of the function Q(0\O^) with respect to the 
constraints imposed on the vector cp and on the elements of the tran- 
sition matrices 7r 1; 7r m is easily achieved using Lagrange method: 
e {l,...,m},Vz,j e y, 



(k) (fc), 

<f y g J lT y g '(I 



m (k) (k) 



Em I 
1=1 <Pl 



7T. 



(13) 
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(fc+1) 



E V {k) (g\i JN(i° m ) 



(14) 




E, m .., g+lig _,., li0 P (fc H^ie l u°_JAr(e 1 u; 




-15) 



where sums are carried out for the variables i m , i g+ i, z g _i, ...,ii,i 
taking values in y, n is the length of the observed sequence y and 
N{i° m ) the number of occurrences of the word i° m in this sequence. 

Initialization To maximize the chance of reaching the global max- 
imum, we run the algorithm from various starting points. One ini- 
tialization is derived from contingency tables between each lag y t - g 
and the present y t as proposed by Berchtold ^ and several others 
are randomly drawn from the uniform distribution. 

EM- Algorithm for MTD models 

• Compute the number of occurrences of each (m+1) -letters word 



• initialize parameters (cp(°' , 7r(°') , 

• choose a stopping rule, i. e. an upper threshold e on the increase 
of the log-likelihood, 

• iterate E and M steps given by equations [Wp4fT5\) . 

• stop when L y (0( k+ V) - L y (0^) < e. 

A software implementation of our algorithm is available in the 
library seq++ at http : / / stat . genopole . cnrs . f r/seqpp. 

4 Applications 

4.1 Comparison with Berchtold's Estimation 

In this paper, we focus on estimation of the MTD; model (see Defini- 
tion [2]) which has a specific but same order matrix transition for each 
lag. We evaluate the performance of the EM algorithm with compar- 
ison to the last and best algorithm to date, developed by Berchtold 
0]. Among others, Berchtold estimates the parameters of MTD; 
models on two sequences analyzed in previous articles: a time serie 
of the twilight song of the wood pewee and the mouse aA-Crystallin 
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Table 2: Maximum log-likelihood of MTDi models estimated by EM and Berch- 
told's algorithm (see |2(, section 5.1 and 6.2). 



Order m 


g=\y\ 


Bcrchtold 


EM 


Sequence 


2 


3 


-486.4 


-481.8 


Pewee 




4 


-1720.1 


-1718.5 


aA-Crystallin 


3 


3 


-484.0 


-480.0 


Pewee 




4 


-1710.6 


-1707.9 


aA-Crystallin 



Gene sequence (the complete sequences appear in [12[ , Tables 7 and 
12). The song of the wood pewee is a sequence composed of 3 dis- 
tinct phrases (referred to as 1, 2, 3), whereas the aA-Crystallin Gene 
is composed of 4 nucleotides: a, c, g, t. 

We apply our estimation method to these sequences and obtain 
comparable or higher value of the log-likelihood for both (see Tab. 

Since the original parametrization of the MTDi model is not 
injective, it is not reasonable to compare their values. To overcome 
this problem, we computed the parameters from the set 0^ defined 
in (JTj). The estimated parameters (using a precision parameter s = 
0.001) of the 2 nd order MTDi model on the song of wood Pewee 
(first line of the Table [2]) are exposed in Figure EJ Complete results 
appear in appendix [C| namely estimated parameters <p,7ti,7t2 and 
their corresponding full 2 nd order transition matrices II. 

For both sequences under study, Pewee and aA-crystallin, EM 
and Berchtold algorithms lead to comparable estimations. The EM 
algorithm proves here to be an effective method to maximize the log- 
likelihood of MTD models. Nevertheless, EM algorithm offers the 
advantage to be very easy to use. Whereas Berchtold's algorithm 
requires to set and update a parameter 5 to alter the vector ip and 
each row of the matrices w g , running the EM algorithm only requires 
the choice of the threshold e in the stopping rule. 

4.2 Estimation on DNA coding sequences 

DNA coding regions are translated into proteins with respect to the 
genetic code, which is defined on blocks of three nucleotides called 
codons. Hence, the nucleotides in these regions are constrained in 
different ways according to their position in the codon. It is common 
in bioinformatics to use three different transition matrices to predict 
the nucleotides in the three positions of the codons. This model is 
called the phased Markov model. 

Since we aim at comparing the goodness-of-fits of models with 
different dimensions, the maximal value of a penalized likelihood 
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Figure 6: Estimation of a 2 nd order MTDi model on the song of the wood 
pewee. We use u=l (song n°l) as reference letter to express the parameters 
defined in Q. 

Estimates obtained with: 
• Berchtold's algorithm (L y (§) = -486.4): 

/ 0.754169 

[Pi(l;*.j')]i< i ,i<3 = 0.991696 
\ 0.993579 

/ 0.754169 

[Pi(2;i,j)] 1 < ili < 3 = 0.137205 
\ 0.048023 



0.198791 0.073356 \ 

0. 0.03462 

0.003497 0.02924 / 

0.198791 0.073356 \ 

0.213411 0.649384 

0.927598 0.044116 / 



EM-algorithm (L y (0) = -481.8): 

/ 0.75305 0.200475 0.046475 

[Pi(l;i,j)] 1 < ii < 3 = 0.991475 0. 0.008525 

\ 0.996425 0.003575 0. 

/ 0.75305 0.200475 0.046475 

\pi{2]i,j)] 1<ij<3 = 0.137525 0.21135 0.651125 

\ 0.02805 0.925475 0.046475 
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Figure 7: Difference according to the BIC criterion between MTD modeis and 
the corresponding fully parametrized Markov Model. 



function against the dimension of parameter space will be used to 
assess each model. The Bayesian Information criterion 13j for this 
evaluation is defined as: 



BIC(M) 



-2L y {0 M ) + d{M)logn, 



where 0m stands for the maximum likelihood estimate of model Ai. 
The lower the BIC a model achieves, the more pertinent it is. 

BIC evaluation has been computed on DNA coding sequence sets 
from bacterial genomes. Each of these sequence sets has length 
ranging from 1 500 000 to 5 000 000. Displayed values in Figure [7| 
are averages over the 15 sequences set of the difference between the 
BIC value achieved by the full Markov model and the one achieved 
by the MTD model of the same order. Whenever this figure is 
positive, the MTD model has to be preferred to the full Markov 
model. 

The full Markov model turns out to outperform the MTDi model 
when the order is inferior to 4. This is not surprising since the 
estimation is computed over large datasets that provide a sufficient 
amount of information with respect to the number of parameters 
of the full model. However, the 5 th order MTD X model and full 
Markov model have comparable performances, and the MTD X model 
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outperforms the full Markov model for higher orders. This is an 
evidence that although MTDi only approximate the full Markov 
models, their estimation accuracy decreases slower with the order. 

Even more striking is the comparison of the MTD2 model with 
the full Markov model. Whatever the order of the model, its goodness- 
of-fit is at least equivalent to the one achieved by the full Markov 
model. The MTD; model turns out to be a satisfactory trade-off 
between dimension and estimation accuracy. 
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Example of equivalent parameters defining the 
same MTDi model 



Let the size state space be 4 as for DNA sequences y 
these two 2 nd order MTDi model parameters 0,0'. 



{a, c, g, t} and consider 



ip= (0.3,0.7) tt\ = 



<p' = (0.2,0.8) tt[ 



( 0.1 0.2 0.3 0.4 ^ 

0.4 0.3 0.2 0.1 

0.2 0.2 0.2 0.4 

\ 0.4 0.2 0.2 0.2 j 



0.2 
0.65 
0.35 
\ 0.65 



0.1 
0.25 
0.1 
0.1 



0.2 
0.05 
0.05 
0.05 



0.5 
0.05 
0.5 
0.2 



TVn 



7T2 



/ 0.1 0.1 

0.2 0.2 

0.3 0.3 

\ 0.3 0.2 



0.075 
0.1625 
0.25 
0.25 



0.1375 
0.225 
0.3125 
0.225 



0.1 
0.4 
0.3 
0.3 



0.7 \ 
0.2 
0.1 
0.2 J 



0.15 
0.4125 
0.325 
0.325 



0.6375 \ 

0.2 
0.1125 

0.2 J 



Both parameters define the same 2 nd order Markov transition matrix II. 
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n = 







a 


c 


Q 


t 


aa 


l 


0.1 


0.13 


0.16 


0.61 \ 


ac 




0.19 


0.16 


0.13 


0.52 


aa 




0.13 


0.13 


0.13 


0.61 


at 




0.19 


0.13 


0.13 


0.55 


ca 




0.17 


0.2 


0.37 


0.26 


cc 




0.26 


0.23 


0.34 


0.17 


ca 




0.2 


0.2 


0.34 


0.26 


ct 




0.26 


0.2 


0.34 


0.2 


ga 




0.24 


0.27 


0.3 


0.19 


gc 




0.33 


0.3 


0.27 


0.1 


99 




0.27 


0.27 


0.27 


0.19 


St 




0.33 


0.27 


0.27 


0.13 


ta 




0.24 


0.2 


0.3 


0.26 


tc 




0.33 


0.23 


0.27 


0.17 


tg 




0.27 


0.2 


0.27 


0.26 


tt 


\ 


0.33 


0.2 


0.27 


0.2 J 



B EM algorithm for other MTD models 

B.l Single matrix MTD model: iteration k. 
E-Step Vg £ {1, ...,m},V i m , ...,h,io e {1, ...,q}, 

v (k) (nU o x _ <Pg k) n lk) (ig,io) 

r S WmJ ~ m (fe) , • 



M-Step Vg e {1, m}, Vi, j £ {!,...,«}, 



tm->0 

E;,E Im ... l5+ll! „,.,p (fc) (9i4 +i «;-ij)Jv(e i ! i 1 j) 



E™ i E im .., +lis .,, 10 ^(ffl^tfS-OM^X-i) 



where sums are carried out for the variables i m , i g +i, i g -i, i\, io varying 
from 1 to q, n is the length of the observed sequence y and iV(i^) the number 
of occurrences of the word in this sequence. 

B.2 MTD| model: iteration k. 

E-Step V.g e {1, ...,m- I + l},Vi m , ...ii,i e {1, •••,?}, 



(fe) (k),.g . \ 

,(fc)^|..0 ^ _ ^ ^ (Vi-1^0) 



n k \g\i° m ) 



y->m-( + ± (K (K ,-h ■ \ 
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M-Step Vg £ {1, ...,m},Vij, ...,h, j £ {!,...,£?}, 



U m . . .Uq 



( * + i) f . . ... E Um ... UB+ , Ug _ 1 ... ul PL fc) (gK + '^ 1 ^- 1 j)iv« + 'i i 1 ^- 1 j) 

where sums are carried out for the variables u rn , u 9 -i, ...,ui,uo varying 
from 1 to q, n is the length of the observed sequence y and iV(i^) i/ie number 
of occurrences of the word in this sequence. 

C 2 nd order MTDi estimates obtained on both 
the song of wood pewee and the mouse aA- 
Crystallin Gene sequence (Section |4JJ). 

1. Song of wood pewee 

Berchtold's algorithm (see 0, section 5.1): L y {0) = -486.4. 



0.097 0.739 0.164 \ / 0.996 0.004 

tp = (0.269,0.731) 7h=| 0.980 0.020 tt 2 = 0.152 0.020 0.828 

0.987 0.013 / \ 0.003 0.997 



EM-algorithm: L y (0) = -481. 



0.102 0.729 0.169 \ / 1 

<p = (0.275,0.725) tti = | 0.969 0.031 tt 2 = 0.151 0.015 0.834 

0.987 0.013 / \ 1 

These estimated parameters define respectively the following 2 nd order 
Markov transition matrices TIb and Hem- 



0.754169 
0.991696 
0.993579 
0.137205 
0.374732 
0.376615 
0.028286 
0.265813 
0.267696 



0.198791 
0. 

0.003497 

0.213411 

0.01462 

0.018117 

0.927598 

0.728807 

0.732304 



0.047040 

0.008304 

0.02924 

0.649384 

0.610648 

0.605268 

0.044116 

0.00538 

0. 



n £ 



/ 0.75305 
0.991475 
0.996425 
0.137525 
0.37595 
0.3809 
0.02805 
0.266475 
\ 0.271425 



0.200475 
0. 

0.003575 

0.21135 

0.010875 

0.01445 

0.925475 

0.725 

0.728575 



0.046475 
0.008525 
0. 

0.651125 

0.613175 

0.60465 

0.046475 

0.008525 

0. 
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2. Mouse aA-Crystallin Gene sequence 

EM-algorithm: L y (6) = -1718.5. 

ip = (0.562,0.438), 

0.600 0.149 0.157 \ 

0.271 0.153 0.241 

0.415 0.099 0.301 

0.370 0.129 0.309 / 

These estimated parameters define respectively the following 2 nd order 
Markov transition matrix Hem- 






167622 


0.341480 





349634 





141264 







240120 


0.431400 





069758 





258722 





193474 


0.331926 





321534 





153066 







134464 


0.370142 





306922 





188472 







273180 


0.197378 





351386 





178056 







345678 


0.287298 





071510 





295514 







299032 


0.187824 





323286 





189858 







240022 


0.226040 





308674 





225264 







207480 


0.260450 





327734 





204336 







279978 


0.350370 





047858 





321794 







233332 


0.250896 





299634 





216138 







174322 


0.289112 





285022 





251544 







210546 


0.240740 





340874 





207840 







283044 


0.330660 





060998 





325298 







236398 


0.231186 





312774 





219642 







177388 


0.269402 





298162 





255048 / 



No detail on the 2 nd order MTDi estimates from the mouse aA-Crystallin 
Gene sequence is given in 



/ 0.225 0.140 0.506 0.129 \ 

0.354 0.300 0.008 0.338 

0.271 0.123 0.456 0.150 

V 0.166 0.191 0.430 0.213 / 



7T2 



/ 0.094 
0.335 
0.185 

I 0.192 
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